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ABSTRACT 



Documents represented as bitmap images are trans- 
formed into coded textual data and coded graphics data 
by graphics and textual recognizeis, which use a stan- 
dard notation for recording the results of the document 
recognition processes, inchiding any ambiguities, in a 
document d^ription language. Recognized portions of 
the document, r^resented as editable coded data, such 
as for example ASCH, are placed in elements, defined in 
the document description language, with all contents of 
an element sharing some common characteristic. Ele- 
ments can include, for example: character-string-cle- 
ments, questionable-character-dements, questionablc- 
word-elements, verified-word-elements, altemate- 
word-elements, segment-elements, and arc-elements. 

20 Claims, 14 Drawing Sheets 
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§ Etymologies 

Etymologies appear in squore brackets [ ] following 
the "definitions". In accordance with the Dictionary's^ 
policy of eliminating special symbols k dingbats... 

• Obscure Origin. According to the noted 
linguist Jumblatt, the origins of numerous English 
words are still obscure ... 




f The American Heritage Dictionary 



FI6.1 
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OBBIENT s - 0 (fPCDATA) — chor string— > 

FIG.2 



<!ELEMENT qc - 0 (fPCDATA) —questionable char— > 

FIG.3 



<!ELEMENT qw - 0 (fPCDATA) —questionable word— > 

FIG.4 



<!ELEMENT vw 


- 0 (fPCDATA) 


— verified word — > 


OELEMENT aw 


- 0 (vw, VW+) 


— alternate words — > 


FIG.5 
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<!ELEMENT text 




-0 1 s {aw |q[ 1 i)w U 


> 


<!ATTLIST text 








id 


ID 


# REQUIRED 




fonf 


IDREF 


# CURRENT 


> 



FIG.6 



<!ELEMEMT fonfDef -0 IttPCDATAI 


—font family namB--> 


<!ATTLIST fontDef 




Id 


10 ^REQUIRED 




size 


NUMBER 12 


—font size in points— 


weight 


[ultroLl extrall lighr| 


—font veight— 




semiL | medium | seiaiB 






bold 1 extras | ultra 






medium 




posture 


(roman | italic) roman 


—printing type style— 


base 


Inornal sub sup) normal 


under 


NUMBER 0 


-underline position if any— > 



FIG.7 



<!ELEMENT segmenf -0 EMPTY > 

<IATTLIST segment 



id 


ID 


# REQUIRED 


xl 


NUMBER 


0 


dxl 


NUMBER 


0 


yi 


NUMBER 


0 


dyl 


NUMBER 


0 


x2 


NUMBER 


0 


dx2 


NUMBER 


0 


y2 


NUMBER 


0 


dy2 


NUMBER 


0 


thick 


NUMBER 


0 


dThick 


NUMBER 


0 



FIG.8 
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<!ELEMENT arc 


-0 


EMPTY 


> 


<!ATTLIST arc 








id 


10 


^REQUIRED 




X 


NUMBER 


0 




dx 


NUMBER 


0 




y 


NUMBER 


0 




dy 


NUMBER 


0 




r 


NUMBER 


0 




dr 


NUMBER 


0 




rShort 


NUMBER 


0 




drShort 


NUMBER 


0 




thick 


NUMBER 


0 




dlhick 


NUMBER 


0 




tlietaO 


NUMBER 


0 




dThetaO 


NUMBER 


0 




thetal 


NUMBER 


0 




dTheral 


NUMBER 


0 




theta2 


NUMBER 


0 




dThefa2 


NUMBER 


0 


> 


FIG.9 


<!ELEHENr image 


-0 


IIPCDATAI 


--imoge file narae--> 


<!ATTLISr image 








id 


ID 


IREQUiRED 




X 


NUMBER 


0 




dx 


NUMBER 


0 




y 


NUMBER 


0 




dy 


NUMBER 


0 




w 


NUMBER 


0 




dv 


NUMBER 


0 




h 


NUMBER 


0 




dh 


NUMBER 


0 




resol 


NUMBER 


300 


> 



FIG. 10 
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<! ELEMENT spot 




-0 l*PCDATAI 


—hexQdeciniQl value--> 


<!AnLIST spot 








id 


ID 


«REQUIRED 




X 


' NUMBER 


0 




dx 


NUMBER 


0 




y 


NUMBER 


0 






NUMBER 


0 




bx 


NUMBER 


0 




by 


NUMBER 


0 


> 



FIG. 11 



<! ELEMENT item 




-0 EMPTY —one element identifier- 


-> 


<! ELEMENT range 




-0 ElffTY —two element Identifiers- 


-> 


<! ELEMENT altern 




-0 1 11 ten 1 range!, litem 1 range Ul 
—alternative sets of element identifiers- 


-> 


OAHLIST item 
r 


IDREF 


# REQUIRED 


> 


<!ATTLIST range 
trom 
to 


IDREF 
IDREF 


# REQUIRED 

# REQUIRED 


> 



FIG.12 
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<!LLLrlLN{ rDlOCK 




-V iQiTern |ireni irangci* / 


✓ 1 ATTI ICT #-RI nrl/ 
<!AI IL 1 Jf rot OCK 






i A 

10 


in 

lU 


-H-DPntiiorn 

•HKlUUIKiIU 


X 


NunotK 


A 

u 


dx 


NUrlDtK 


u 


y 


KniMRPD 
NUrlDLn 


U 


dy 


NunotK 


A 
U 


w 


lit tun rn 

NUnocK 


A 

0 


J. . 

QW 


NUMBER 


0 


h 


NUMBER 


0 


dh 


NUMBER 


0 


xl 


NUMBER 


0 —abscissa of 1st char in block— 


dxi 


NUMBER 


0 


yi 


NUMBER 


0 


dyl 


NUMBER 


0 


inferl 


NUMBER 


0 —interline— 


dir 


Ihoriz vertit) —text flow direction- 






horiz 


align 


deft | 


center [ right | justL | justC | justR) -alignment- 
left > 



FIG.13 



<! ELEMENT frame 




-0 inlfern | 


item 1 rangeU 


> 


<!ATTLIST frame 








id 


ID 


REQUIRED 






X 


NUMBER 


0 




—abscissa-- 


dx 


NUMBER 


0 




—error on x— 


y 


NUMBER 


0 




—ordinate— 


dy 


NUMBER 


0 




—error on y— 


w 


NUMBER 


0 




-width- 


dw 


NUMBER 


0 




—error on w— 


h 


NUMBER 


0 




—height— 


dfi 


NUMBER 


0 




—error on h— > 



FIG. 14 
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<! ELEMENT page 


-0 


laltern | iteni | rangeh 


> 


<!ArrLIST page 








id 


ID 


REQUIRED 




w 


NUMBER 


0 


--widtb" 


h 


NUMBER 


0 


--heighf— > 


FIG. 15 


<! ELEMENT group 


-0 


laltern | \\m | rangeU 


> 


<!ATTLIST group 








id 


ID 


iiREQUIRED 





FIG.16 



<!DOCTYPE drStreonif 




(page | frame | group | tBlockl 

text 1 segment | art | 

fontDcf 1 image 1 spot U > 


<!ELEMENT drShreoni 


-0 


<!ATTLIST drStreom 






unit (meter 


point 1 inctil 
inch 


— meosurement unit— 


fraction NUMBER 


1200 


--froction of measurement unit— > 



FIG.17 
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<!DOCTYPE drSrreani{ 
<! ELEMENT drStreoni 



-0 (page | frame | group | tBlock 
l-ext I segmenf | arc| 
fonfDef I image I spotU 



<!ATTLIST drStreom 

mW ineter | point | inch) 

inch 

fraction NUMBER 



<! ELEMENT page 

<!ATTLIST page 
id ID 
w NUMBER 
h NUMBER 



--meosurement unit— > 
1200 —fraction of measurement unit— > 

0 Ultern | item | range)^' 



^REQUIRED 

0 

0 



<! ELEMENT 
<!ATTL1ST 



frame 
frame 



-0 laltern | item | rangelt 



<! ELEMENT group 
<!ATTLIST group 
id ID 

<! ELEMENT tBlock 

<!ATTLIST tBlock 
id ID 
X NUMBER 
dx NUMBER 
y NUMBER 
dy NUMBER 
V 



-0 laltern | item | rangeU 

^REQUIRED 
-0 ialtern | item | rangeU 



^REQUIRED 

0 

0 

0 

0 



FIG. ISA 



— vidtb— 
-heigKf"— > 



id 


ID 


^REQUIRED 




X 


NUMBER 


0 


— absrissO" 


dx 


NUMBER 


0 


— error on x— 


y 


NUMBER 


0 


--ordinote— 


dy 


NUMBER 


0 


— error on y— 


w 


NUMBER 


0 


— widffi— 


dw 


NUMBER 


0 


—error on w— 


li 


NUMBER 


0 


— tieight— 


dh 


NUMBER 


0 


—error on fi— > 



> 
> 
> 
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f 

w 


A 

NUMBER 


0 




dw 


NUMBER 


0 




h 


NUMBER 


0 




dh 


NUMBER 


0 




xl 


NUMBER 


0 


—abscissa of 1st char in dIock— 


dxl 


kit itinrn 

NUMBER 


0 




yi 


NUMBER 


0 




dyl 


NUMBER 


0 




inhrl 


NUMBER 


0 


—Interline— 


dir 


Ihoriz 


1 vertic) 

horiz 


—text flow direction— 


al ign 


(left I 


center right 


justL juste justRl — alignnient— 



FIG.18Acont. 



<! ELEMENT \\m 
<!ATTLIST I fern 

r IDREF 

<! ELEMENT range 

<!ATTLIST range 
from lOREF 
to IDREF 

<! ELEMENT a I tern 



<!ELEMENT spof 
<!ATTLIST spot 
id 

X 

dx 

y 

dy 
bx 
by 

V 



ID 

NUMBER 
NUMBER 
NUMBER 
NUMBER 
NUMBER 
NUMBER 



—one element identifrer~> 

> 

—two element identiflerS"> 



-0 EMPTY 

^REQUIRED 

-0 EMPTY 

^REQUIRED 
^REQUIRED 



-0 ((item I ronge) .litem I rangel+l 

—alternative sets of element identifiers— > 



-0 BPCDATAI 

tt^REQUIRED 

0 

0 

0 

0 

0 

0 

) 



-hexadecitnat value— > 



FIG.18B 
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<! ELEMENT image 

<!ATTLIST image 

id ID 

X NUMBER 

dx NUMBER 

y NUMBER 

dy NUMBER 

V NUMBER 

dw NUMBER 

h NUMBER 

dh NUMBER 

resol NUMBER 

<l ELEMENT arc 
<!ATTLIST arc 



-0 HtPCDATAl 

4tREQUIRED 

0 

0 

0 

0 

0 

0 

0 

0 

300 

-0 EMPTY 



— irnagfi file nonie~> 



id 


ID 


itREQUIRED 


X 


NUMBER 


0 


dx 


NUMBER 


0 


y 


NUMBER 


0 


dy 


NUMBER 


0 


r 


NUMBER 


0 


dr 


NUMBER 


0 


rShort 


NUMBER 


0 


drShort 


NUMBER 


0 


thick 


NUMBER 


0 


dThick 


NUMBER 


0 


thetaO 


NUMBER 


0 


dThe^aO 


NUMBER 


0 


thetal 


NUMBER 


0 


dThetai 


NUMBER 


0 


theta2 


NUMBER 


0 


dTheta2 


NUMBER 


0 



FIG.18B cont. 
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<!ELEMENT segment 
<!ATTLIST seginant 

id 

xl 

dxl 

dyl 

x2 

dx2 

y2 

dy2 

thick 

dThick 

<! ELEMENT fontDef 
<!ATTLIST tontOet 
id 

size 
weight 



posture 

base 

under 

<! ELEMENT text 
<!AnLIST text 
id 

font 

<! ELEMENT s 
<!ELEMENT aw 
<! ELEMENT vw 
<! ELEMENT qc 

<! ELEMENT qw 
> 



-0 EMPTY 

ID #REQUIRED 

NUMBER 0 

NUMBER 0 

NUMBER 0 

NUMBER 0 

NUMBER 0 

NUMBER 0 

NUMBER 0 

NUMBER 0 

NUMBER 0 

NUMBER 0 

-0 IttPCDATA) 

10 ^REQUIRED 
NUMBER 12 
(ultraL I extroLl light| 
semiL | medium | semiB 
bold I extroB | ultra 

medium 

Iroman | italici roman 
(normal | sub | sup! normal 
NUMBER 0 



—font family nanie--> 



--font size in points— 
—font weight— 



—printing type style- 
underline position if any— > 



-0 (s jaw I qc I qw) + 

ID ^REQUIRED 

IDREF # CURRENT 

-0 IttPCDATAl 

-0 lvw,vw+l 

-0 IttPCDATAl 

-0 tItPCDATAl 

-0 HtPCDATAI 

FIG.18C 



—char string— > 
—alternate words--> 
^-verified word— > 
-questionable char— > 

-questionable word— > 
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SCANNER 



too 




SE6MENTER 



150 
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STRUCTURE 

IMAGE 
RECOGNIZER 



200 



300 



i 



CHARACTER 
RECOGNIZER 



400 



± 



WORD 
RECOGNIZER 
(DICTIONARY) 



500 



SEMANnCS 
ANALYZER 



FIG. 19 
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S100 



SEGMEKT 
BrTMAP IMAGE 

INTO TEXT 
AND GRAPHICS 



SI 60 



S110 



S120 



TRANSFORM 
BmilAP GRAPHICS 

INTO CODED 
GRAPHICS DATA 



TRANSFORM 
BITMAP TEXT 
INTO CODED 
CHARACTER DATA 



SI 70 



L 



PIACE UNRECOGNIZED 
BITMAP IMAGES IN IMAGE 
OR SPOT ELEMENTS 



1 



PLACE CODED GRAPHICS 
DATA INTO (SEGMEKT 
OR ARC) GRAPHICS 
ELEMENT 




S180 



RECORD 
UNCERTAINTf 
INFORMATION 



S185 



PLACE CHARACTER 
IN QUESnONABLE- 
CHARACTER- 
ELEMENT 



PLACE CHARACTER 
IN CHARACTER- 
STRING-ELEMENT 



FIG.20 
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SUBSTITUTE CHARACTERS 

FOR QUESTIONABLE 
CHARACTER-ELEMENT IN 
WORD CONTAINING 
QUESnONABlf-CHARACTER- 
ELEMENT 



S200 



N 



RETURN 
QUESnONABLE- 
CHARACTER- 
ELEMEi^ 



S220 



S240 



PLACE VERmED 
WORD IN VERinED 
WORD-ELEMENT 



UPDATE 
UNCERTAIN1Y 
INFORMATION 



S245 



S230 



PLACE MULTIPLE 
VERIRED-WORD 
ELEMENTS IN AN 
ALTERNATE-WORD 
ELEMENT 
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defined by the document grammar (for example "TI- 
METHOD AND APPARATUS FOR CONVERTING TLE" and "AUTHOR") is initially represented by 
BITMAP IMAGE DOCUMENTS TO EDITABLE variables. See HG. 10 of U.S. Pat No. 4,907,285. After 
CODED DATA USING A STANDARD NOTATION locating this region in the docmnent, the appropriate 
TO RECORD DOCUM ENT R ECOGNITION 5 numeric values are substituted for the variables. 

AMBIGUITIES U.S. Pat No. 4,949,188 to Sato discloses an image 

processing apparatus for synthesizing a character or 
BACKGROUND OF THE INVENTION graphic pattern represented by a page description lan- 

1. Field of the Invention guage and an original image. The image processing 
The present invention relates to document recogni- apparatus generates a page description language mclud- 

tion, and in particular to methods and apparatus for ing code data which represents characters, graphics 

recognizing textual and graphics structures in docu- patterns, and the like, and command data which causes 

ments origmally represented as bitmap images, and for a printer to print the original image. Ambiguities from 

recording the results of the recognition process. previous document recognition processes are not re- 

2. Description of Related Art corded in the page description language. See, for exam- 
Document recognition is the automatic transforma- pie, the table in column 4> lines 5-10. Accordingly, any 

tion of paper documents into editable electronic docu- downstream device receiving the page description Ian- 

ments. It entails the gradual transformation of bitmaps guage cannot determine whether any ambiguities oc- 

into structured components, through successive and curred in the previously performed document recogni- 

recursive interventions of various processes. These pro- tion processes. 

cesses include: page segmentation, character recogni- U.S. Pat. No. 4,654,875 to Srihari et al discloses a 

tion, graphics recognition, logical structure rcconstruc- method of automatic language recognition for optical 

tion, spelling correction, semantic analysis, etc. AH character readers. Language in the form of input strings 

these processes are prone to misinterpretation. Not all qj. structures is analyzed on the basis of: channel charac- 
processes keep a record of the misinterpretations they 25 ^eristics in the form of probabilities that a letter in the 

are aware of, and the ones that do keep a record have no ijjp^ ^ corruption of another letter; the probabilities 

standard way of doing so. As a consequence, down- ^j^g jg^^gj. occurring seriaUy with other recognized 

stream processes are generally not prepared to handle ^^^^ ^ precede the letter being analyzed or particu- 

the record of ambiguities handed to them by upstream ^ j^^^^ occurring scriaUy; and lexical infor- 

processes, and simply discard them. Valuable informa- 30 ^^^^ ^ of acceptable words represented as a 

tion is lost instead of being exploited for automatic ^ structure. Ambiguities ftom upstream recogni- 

impiovcment of the document recogmtion function. If, processes are not recorded, 

on the other hand, the ambigmty record is passed m its .^^^ Assodation Norms. Mutual Information, and 

raw state to the user, the chore of makmg manual cor- Lexicography", by Kenneth W. Church and Patrick 

rections can qmcklyoutwagh the advantage of aute^ Hanks. Computational Linguistics, Vol. 16, No. 1 

matic recogmtion over a manual reconstruction of the ^^^J^ ^^^^^ ^ ^^^^ ^^^^^^ ^ 

^!?^^™?f^^ . /..^ ^ A nr,A ^^rx . ^ u A- "associatiou ratio" based on the information theoretic 

U.S. Pat Nos 4,9 14,709 and ^^^'260 to R»^^^ ^ information, for esthnating word asso- 

close an apparatus and method for ident^g a^ cor- ^ ^^^^^ ^ 

rectmg characters which cannot be machme read. A 40 tiv* if ^„„„+j„^ Z„^u,^^^ *^ 

bitmap vWeo image of the unrecognized characters) is assoaatton ratio cai be usedby a sen^bcs^^ 
insert^ in an ASCH data line cf^boring charac- determine a most JAdy word from a choice of two or 
^^er^y aUowing an operator totiew tl^ charac- more words that have been iden^ed as po»^^^ 

JLion in ImtexTtokid in proper identifica- "On the Reco^tion of tinted QiaiaOers of Any 
tof of the character(s). Subsequently, the aid of 45 Font and Size", by Simon IWan. Theo Pavhdis and 

thevideoimage,S=^l4ratorSterslLcorrectchar«^ "?S\^'"f'J^^^'T? ^.^^^^ISS 
tcr(s) viaake^ard or other means. TTiis apparatus and and Machme Intelhgence. VoL PAMl-9. No. 2 (March 
,Sbodi«iukeoperatorinteractiontoclar^ambi- 1987), dfecloses a system that re«)gmzes pmte^ 
goities rioting ftom an automatic document recogni- vanous fonts and sizes for the Roman ^phabet Thm- 
tion process, "ftie results of these ambiguities are not 50 ning and shape ertrac&on are performed directly on a 
recorded m a notation that can be used by other down- graph of the run-loigth encodmg of the binary miage. 
stream automatic devices. resulting stiok^ and other shap« are mapped. 

U.S. Pat. No. 4,907,285 to Nakano et al discloses an "sing a shape-clustermg approach, into binary featoes 
image recognition system which uses a grammar for which are then fed into a statistical Iteyesian classifier. 
desCTibing a document image, and parses statements 55 This system identifies multiple possible characters or 
expressed by the grammar to recognize the structure of words, and scores them. However, the uncertamty m 
an unknown input image. The grammar describes the the recognition processes is not recorded usmg the 
image as substructures and the relative relation between standard notation of the present mvenUon. 
them. In the parsing process, after the substructures and In summary, a number erf systems east which can 
their relative relation are identified, a search is made as 60 recognize graphics structures, text (characters, words, 
to whether the substructures and then: relative relation semantics, fonts) and logical structures (pages, para- 
exist m the unknown input image, and if they do. the graphs, footnotes), and which can determine the uncer- 
indde of the substructures are further resolved to con- tainty with which the recognized feature was recog- 
tinue the analysis. If the substructures do not exist, other nized. Accordingly, the above-identified patents and 
possibiKties are searched and the structure of the un- 65 papers are incorporated heron by reference. However, 
known input image is thus represented from the result none of these systems record the results of the recogni- 
of the search. For example, the location of a rectanguhr tion process (indudmg uncertamties)in a manner which 
region of the document which contains a statement can be used by other devices. This results in much infor- 
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mation (particularly regarding uncertainty) being lost, word recognizer attempts to resolve any existing ques- 

cspccially when different recognition systems (e.g., tionable characters by determining whether any words 

character recognizers, word recognizers, semantics exist in a lexicon based upon each questionable chazac- 

analyzers) are used at different times (as opposed to ter and the certain characters in the word containing 

being integrated into one system). 5 each questionable character. If a word is identified in 

the lexicon for the word containing a questionable char- 

OBJECTS AND SUMMARY OF THE ^^^^ ^^^^ identified as a verified word, and is 

INVENTION recorded in a vcrified-word-element. If more than one 

It is an object of the present invention to provide verified words are found, they are placed in individual 

methods and apparatus for recording ambiguities in 10 verified-word-elements which are collectively grouped 

document recognition processes in a standard format together in an altemate-word-element If no verified 

which can be used by a variety of document recogniz- words are found for the word containing a questionable 

ers. character, the questionable-character-eleinent remains. 

It is another object of the present invention to pro- When the document recognizer includes a semantics 
vide methods and apparatus for converting bitmap im- 15 analyzer, any identified alternate verified words are 
ages into editable coded data, wherein information re- resolved by analyzing the words surrounding the alter- 
garding ambiguities m the transformation processes nate verified words. If one of the alternate verified 
performed by upstream recognizers can be recorded words can be confirmed with a predetermined level of 
and thus used by downstream, higher level recognizers confidence based on the semantics analysis, it is re- 
which attempt to resolve these ambiguities. 20 turned and merged with the surrounding character- 
To achieve the foregoing and other objects, and to string-elements. If the semantics analyzer cannot deter- 
overcome the shortcomings discussed above, methods mine which of the alternate verified words is correct, it 
and apparatus are provided for converting documents returns the altemate-word-element (and included veri- 
represented as bitmap image data into editable coded fied-word-elements) as such, and can include data indic- 
data, wherein a standard notation in a document de- 25 ative of the probabiHty that each verified word therein 
scription language is utilized for recording document is the correct word. 

recognition ambiguities by each document recognizer. When the document recognizer includes a graphics 
When the results of document recognition processes are structure image recognizer, it outputs graphics-ele- 
rccorded using this standard notation, any ambiguities ments containing coded data representative of graphics 
are identified in a uniform manner so that downstream, 30 structures recognized in the graphics image. These 
higher level document recognition processes can at- structures can include: lines defined between endpoints;. 
tempt to resolve these ambiguities by using all informa- circles; arcs; etc. Additionally, line thickness informa- 
tion about the ambiguities curtained by upstream docu* don can also be returned and recorded. Ambiguities in 
ment recognition processes. the recognition process such as x and y direction offeets 
In particular, when using the standard notation of the 35 and line thickness variations can also be recorded. This 
present invention, each document recognizer can re- data can be used by downstream higher level graphics 
cord the results of its recognition process in one or more recognition processes to resolve any amlnguities, or to 
dements, selectively identified using the document de- recognize more complex graphics structures. For exam- 
scription language. Each element includes a type^iden- pie, four lines recognized by a low level graphics recog- 
tifier indicating a type of coded data information) re- 40 nizer could be determined to be a box by a higher level 
garding the recognized (transformed) bitmap image graphics recognizer if, for example, the endpoints can 
contained therein. Each element also includes editable be determined with a high degree of certainty to be 
coded data therein of the type identified by the type- coincident. 

identifier, and can also include uncertainty information Additional image recognition elements are produced 

identifying any coded data which was not transformed 45 for recording information relating to larger portions (or 

with a predetermined level of confidence. This uncer- * subimages) of the document image. For example, data 

tainty information is determined by the document rec- related to font text blocks, firames, pages, documents, 

ognizer, and is recorded in a format that is readable by and large and small pieces of unresolved bitmap images 

higher level, downstream document recognizers. This can also be recorded. 

uncertainty information can mdude the level of confi- 50 ^ypp nF<;rT?n>TrnN OF THP DRAWINGS 

dence with which the uncertain coded data was recog- DESCRIFnON OF THE DRAWINGS 

nized by the docmnent recognizer, in order to further The invention will be described in detail with refer- 

assist the higher level document recognizers in resolv- ence to the foUowing drawings in which like reference 

ing ambiguities. The uncertainty information can also numerals refer to like elements, and wherein: 

include alternative coded data for each uncertain recog- 55 FIG. 1 is a sample page image used to illustrate the 

nition. present invention; 

When the document recognizer is a character recog- FIG. 2 illustrates a character-string-element for col- 

nizer, any characters which are not recognized with a lecdng streams of characters recognized with or above 

predetermined level of confidence arc identified and a predetermined confidence level; 

recorded by pladng them in questionable-character-ele- 60 FIG. 3 illustrates a questionable-character-element 

ments. The degree of certainty as well as alternative for collecting questionable characters recognized with a 

possible characters and their degree of certainty can low confidence level; 

also be recorded for each questionable character. Char- FIG. 4 illustrates a questionable-word-element for 

acters which were recognized with at least the prede- collecting a questionable word which contains charac- 

termined level of confidence are placed in character- 65 ters recognized with high confidence, but which was 

string-elements. not found in a lexicon; 

When the document recognizer includes a word rec- FIG. 5 illustrates verified-word-elements for coHect- 

ognizer (such as, for example, a spelling checker), the ing verified words found in a lexicon by resolving a 
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word containing one or more questionable characters, document. The information describes text with font, 
and an altemate-word-clcmcnt for collecting alternate certain graphics primitives, and half tone images, as 
words when two or more verified words are found for well as their relationships, and the ambiguities about 
one word containing questionable characters; them. 

FIG. 6 illustrates a text-element for collecting text 5 The present invention does not provide any new 
elements having the same font; document recognition processes (or document recog- 

FIG. 7 illustrates a fontDef-element for collecting nizers) in the sense that it can be used with existing 
data relating to a font type; recognizers which recognize, for example, characters 

FIG, 8 illustrates one type of graphics-element which or graphics structures, or determine words (by compar- 
is a segment-element for collecting data relating to a 10 ing sequences of characters against a lexicon of known 
line segment; words), or determine which word from a choice of 

FIG. 9 illustrates another type of graphics-element possible words is correct. However, the present inven- 
which is an arc-element for collecting data relating to tion improves the efficiency and compatibility with 
an arc; which these different types of recognizers function by 

FIG. 10 illustrates another type of graphics-element 15 providing a standard notation for recording the results 
which is an image-dement for collecting data relating obtained by the recognizers in a document description 
to a large unresolved bitmap imag^ language. 

FIG. 11 illustrates another type of graphics-element FIGS. 2-18C illustrate this document recognition 
which is a spot-element for collecting information relat- notation in ISO 8879 Standard Generalized Mark-up 
ing to a small unresolved bitmap image referred to as a 20 Language (SGML), according to the Document Type 
spot, and for storing this information as a hexidedmal Definition discussed below. According to the present 
value; invention, each recognizer records coded data, corre- 

FIG. 12 illustrates examples of elements referring to spending to the results of the recognition process which 
other elements; it performs, as coded information, referred to in SGML 

FIG. 13 illustrates a tBlock-elcment for collecting 25 as elements. Each element contains coded data which 
information relating to blocks of text; has been recognized as being similar in some way (for 

FIG. 14 illustrates a frame-dement for collecting example: text, graphics, same page, all certam charac- 
information relating to frames which can include text ters, etc.). Each element incltides: a) a type-identifier 
blocks, images, spots, arcs and segments, as well as which indicates the type ofcoded data contained in that 
other frames; 30 element; b) an optional identification number, unique 

FIG. 15 illustrates a page-element for coUecting data amongst all similar type elements of a document, which 
relating to a pag^ distinguishes that element from other similar type ele- 

FIG. 16 illustrates a group-dement for collecting ments so that an element can be referenced by other 
information rdating to a group of dements which ex- dements (most dements will have an identification 
tend across page boundaries 35 number); c) coded data obtained by the document rec- 

FIG. 17- illustrates a drStream-dement for coUecting ognition process (this could be strings of characters or 
data rdating to an entire document; parameters defining graphics structures); and d) op- 

HGS. 18A-C are a collection of all syntax necessary tional contents (referred to as attributes) for providing 
for describing a document; additional information (for example, uncertainty infor- 

FIG. 19 is a block diagram of a system for inputting 40 mation) about the coded data ioduded in that dement 
and converting a bitmap image into coded data streams Although the attributes of an dement can be used to 
using the present invention; record uncertainty information about coded data in an 

FIG. 20 is a flowchart illustrating a procedure per- element (information such as, for example, levds of 
formed by the system of FIG. 19 when using the present confidence with which the coded data was recognized, 
invention; and 45 or possible offeets for parameters (e.g. endpoints defin- 

FIG. 21 is a flowchart illustrating a procedure per- ing a line segment) of a graphics structure), the type- 
formed by the word recognizer of FIG. 19 when using identification in some cases also serves to convey uncer- 
the present invention. tainty information by indicating that the contents of that 

„ „ ,Mr^^,., dement was determined with a levd of confidence 

DETAILED DESCRIPTION OF THE ^^^^ ^ predetermined levd of confidence. In the illus- 

PREFERRED EMBODIMENTS trated examples, the coded data is recorded as human 

The present invention utilizes a straight forward pro- readable ASCII, however other codes could also be 
cedure for recording ambiguities through the successive used. 

stages of the document recognition process. These am- One familiar with SGML will understand the generic 
biguities are in the context of: 55 contents of the elements to be described bdow. Thus, 

characters processed by character recognizers; only a brief discussion of a generic dement will be 

words processed by character recognizers, spelling provided with reference to FIGS. 18A-C Then, each 
diedcers, and semantics analyzers; type of dement will be specifically described with refer- 

text flow processed by logical structure reconstruct- ence to FIGS, 2-17. FIGS. 18A-C illustrate a complete 
er^ and 60 syntax of elements which can be used to describe a 

geometry of line segments and arcs processed by document according to the present invention. This list 
graphics recognizers. of elements would be located at the start of each 

Each of these processes produce and/or consume a DRstrcam, and would be used by conventional parsers, 
byte-oriented data stream (hereinafter referred to as a programmed to parse streams written in SGML, to 
document recognition stream or DRstream), and bit- 65 parse the DRstream contained therebdow. That is, 
map streams (hereinafter referred to as image files), after the syntax Mst of elements, a continuous stream of 
referenced by the DRstream. The DRstream carries elements describing a specific documoit would be pro- 
information about one or several pages of a digitized vided. As used herein, the terminology continuous 
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Stream of elements* refers to a group of elements which not have id numbers, but instead can be placed in 

are identified as belonging together. Thus, in a markup larger elements. 

language such as SGML, where white-space is permit- With reference to the FIG. 1 image, a portion of that 

ted (and, in fact, encouraged for readability), tabs, image having a series of characters recognized with at 

breakage into separate lines constitute white-space that 5 least a predetermined level of confidence by a character 

the parser will ignore. In this sense, white-space is part recognizor would be recorded using the present inven- 

of the continuous stream of elements. Other systems tion implemented in SGML as follows: 
may have a limit on the size of character streams. In 

these systems, long DRstreams would be split across <s>Etyinologies appear in sqwc 

several files which would be identified as belonging 10 bnctetsOfonowing </> 

together. Such a DRstream, where several files are ^ .u.u 

.j^c J LI - * lu- : ^^^A^^^u^ <s> the ^'definitions". In accordance with the </> 

identified as belonging together, is also mtended to be 

S^^^^Sm^o^^^^^^ ^ .uestionable^haracter^^^^^ (^) 

attribute (to be described bdow) which also would be 15 T^^^^ ^, ^^^^^ '^f^ ^^^J^r^^Z^ 

listed at the start of the DRstream.) Of course, an of the ^Y^^ ^^T^ ^ correcUy recogmzed. 

elements listed in FIGS. 18A-C are not required to Existmg chantcter recognizers currency determine a 

record the results of a document recognition process; 1<^<=1 of confidence for each character. If a character :s 

however, when more elements are provided, more in- not recogmz^ with at least a predetermmed level of 

formation can be recorded. Referring to FIG. 2, in 20 confidence, these character recogmzers soniehow tag 

SGML: the teiminology "DELEMENT s" means "de- the character. However, bnngmg an uncertain charac- 

fine an clement whose type is V"; the terminology "-0" ter to the attention of the user is another matter. Some 

means *^e element begins whenever its type identifier vendors have an interactive package where recognizing 

appears bracketed < >, the clement ends with </> and asking a user for guidance are mtertwmed; it is not 

(elementncnd marker), or when another element begins 25 known whether these systems tag uncertain characters 

at the same or a higher level in the nesting structure"; as such, because it is an internal matter, and the uncer- 

and "(#PCDATA)'* means "the contents of this el©- tainty is lifted right away by user intervention. Other 

ment is a character string". Thus, FIG. 2 defines an vendors merely tag the uncertain characters, say with a 

element containing a character string (such as "horse**) pair of question marks, creating the problem that the 

which would be recorded as follows: 30 next process down the line cannot distinguish these 

question marks from genuine ones. However, question- 

<s >horse</s> ; or able characters are not recorded in a manner that can be 

used by other machines. (That is, question marks and 

<8>horse</>; or highlighting may have some other meaning.) Thus, 

35 when this data is passed to a higher level device such as 
<s> one ^ spelling checker, the spelling checker will not be able 
Other possible contents of a element can be other ele- ^o utilize the infoimatiOTi that thecharacter was not 
ments for example, the aw element of FIG. 5 recogmzed with a high degree of «^rtamty 
which includes two or more vw dements as its con- }^ the prc^t mvcn^on, a higher level device re- 
tents), or only attributes (represented by EMPTY and a 40 ceives the mformation that a character was not recog- 
attribute list-see FIG. 8). The terminology indi- nized with a high degree of certamty smce every char- 
cates that the immediately preceding item can be re- acter located in a questionable-character-element con- 
peated. These definitions will become more clear as tains that characteristic. Thus, by usmg a notation m a 
each element is defined in more detaU below. document description language to record ambiguities, 
FIG. 1 is a sample page image whkh will be used to 45 other recognizers can utilize uncertamty information, 
illustrate the types of bitmap images which can be trans- Preferably, each qc element carries one questionable 
formed and recorded, and their form of recordation, character. The qc element could also cont^ a list of 
using the present invention. The sample image includes alternate characters if the character recpgmzer identi- 
various interesting features, such as: characters hard to fies more than one possible character below the prede- 
recognize because of their poor shape or poor qpiality; 50 termined confidence level for a particular portion of a 
structured graphics in the form of two line segments; bitmap image. Additionally, the degree of certainty for 
bitmap graphics in the form of some undefined drawing; the one or plurality of questionable characters can also 
lo^cal structure in the form of footnote and its callout be provided in each qc element. Ideally, questionable- 
character, character-elements are subsequentiy eliminated by a 

FIG. 2 illustrates a charactcr-string-element (s) into 55 spelling checker, 
which a character recognizer collects characters that For example, the system described in the above incor- 
meet the following conditions: porated paper by Simon Kahan et al could be used to 
all characters have been recognized with a high con- generate alternate characters (or words), each having 
fidence levd (having at least a predetermined level some type of measure indicative of the level of confi- 
of confidence); 60 dence associated with that character (or word). How- 
all characters have the same font, baseline position ever, unlike the system disclosed by Kahan et al, this 
and underlining status; and information relating to characters and/or words would 
there is no significant white gap between each char- be recorded in appropriate, distinct elements using a 
acter (for instance, characters that are horizontally document description language according to the pres- 
aligned but belong to two columns of text, sepa- 65 ent invention. Tliis would enable other, higher level 
rated by a certain amount of white space, are not document recognizing processes (which may be sepa- 
put together in the same element). The illustrated rate from and used at a time separate from the Kahan et 
type-identifier is "s". Character-string-elements do al system) to access this information in a uniform way. 
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The present invention also permits existing recognizers 

to operate in a more efficient manner. For example, by "^v'^ °f E««iish<A> 

distinguishing between oertafc and uncertain characters <.w><vw>w«ds</vw> <vw>words<Aw. 

(or words), more comphcated and time consummg rec- ^ </aw> 
ognition procedures can be limited to the uncertain 5 

characters (or words). <s>are stffl obscure </s> 

FIG. 4 illustrates a questionable-word-element (qw) 

into which a word recognizer (e.g, spelling checker) The stream of elements could be supplied to a seman- 

places words that contam letters recognized with a high tics analyzer which would attempt to determine which 
level of confidence, but which are not found in the 10 word was correct If the semantics analyzer can deter- 

lexicon of the word recognizer. There is one question- mine which word is correct, it merges that word into 

able word per qw element These questionable words the surrounding s-dements. For example, assume the 

can be resolved by other word recognizers which in- following data is provided to the semantics analyzen 
elude different lexicons, or by some other means (such 

as a semantics analyzer), to be described below. mim«rous English </s> 

With reference to FIG, 1, suppose all the characters <aw> 

in the word "Jumblatf* were confidently recognized, <vw>wards</> 

but the speU checker of the word recognizer did not ^'Z^'IZt!^^^-^ 

find the word -Jumblatt" in its lexicon. It would be arestui obscuia</> 

recorded in a qw element as follows: ^^^^ ^^^^^ 

r,™ki,*t^ /-^ "wards", is the correct choice. It can replace the above 

<qw>Juiiiblatt</>. . ' i « 

notation by any of the choices below (it does not really 

FIG. 5 iUustrates a verified-word-element (vw) and matter which choice is selected, however the first 

an altemate-word-element (aw) into which a word rec- 25 choice is the most logical and the second choice is the 

ognizer places words which are found in its attempt to most expedient): 

eliminate questionable-character-elements. The word <s>, the origins of numerous English words are stiU 
recognizer looks for words in a lexicon for each occur- obscure. </> 

rence of a questionable character based upon the word <s>» the origins of numerous English <s>word- 
assodated with a questionable-character-dement. If a ^ s<s> are still obscure. </> 
word is found in its lexicon, the word recognizer places <s>, the origins of numerous English words <s> 
that word in a vw element When the word recognizer are still obscure. </> 

tries to eliminate questionable characters, it may find <s>. the origins of numerous English <s> words 
several words, verified in its lexicon. If the word recog- are still obscure. </> 

nizer cannot dedde between the verified words, it ^5 It should be noted that the intermediate </>s have 
places each of them in a vw dement and places the set been omitted since they are optional, 
of vw dements in an aw dement for the benefit of a FIG. 6 illusttates a text-dement which is used to 
downstream process such as the semantics analyzer. collect character data (s, aw, qc and qw elements) of the 

The semantics analyzer would then attempt to deter- same font. A text element has an id attribute, allowing it 
mine which of the verified words is correct by analyz- ^ to be referenced by higher elements, and an optional 
ing the words surrounding each occurrence of alternate reference to a foot identifier (defined bdow). If the font 
^ords. reference is not supplied, the most recently supplied one 

The word recognizer could use various conventional is used. The text-cements are produced by character 
processes for sdecting words to be compared with a recognizers that can discern different fonts. An example 
lexicon. For example, every letter of the alphabet could of data recorded in a text-element is as follows: 
be substituted for the questionable-character-dement in 

the word containing tiiat questionable-characternde- <^tM= m foni=2>iisi of s, aw. qc and qw 

ment, and these results searched in the lexicon. If alter- einents< > 

nate questi<»ablediar«teis arfprovided in a qu^ti^ 5^ , ahis^ates a fontDef^ent Type faces ana- 

aWe^charact«<lement, the ^LT^'^i^T^ ly«d by the character recognition process^ recorded 

ited to only toe altenuite qu«tionaWe cterao^^ iTfontDef elements vdth as much formation as possi- 
venfied words are found, the ijuesttoi^ble^^^- tie. The contents of a fontDefelement is the font famUy 
element would temam, and opttoiMJly, "^^^y ^ame, if the character recognizer is able to derive it 
mfonnation contamed therem could be updated by the ^^^^^^ i„ that font name can- 

w«Md recognizer. . _^ not be derived, the contents is left empty; it could be 

Consider, for example, the two smng and qu^bon- ^J^^ ^y a downstream pr<^ or intera^ 
able character from FIG. 1 illustrated below found by a tivdybyawCT 

character recognizer .j^ id-attribute enables text elements to reference 

<»>4eorigi«5ofnumero«sEngUsh«</s> «> font descriptions. The size-attiibute is measured in 

points. The base-attnbute indicates whether the base 
<qc>a</q> line is offset by superscripting or subscripting. If there is 

underlining, the nnder-attribute indicates the position of 
<s>Tds are slill obscaie</s> the underline below the base line of the font. An exam- 

65 pie of data recorded in a fontDef-element where the 
the word recognizer, trying to reduce the questionable font family name is Frutiger is as follows: 
"a", find "wards" and "words" as candidates and re- 
places the above notaticm by: <fontDef id=2 tize=10 un<ler=l>Fnjliger</> 
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Note that the attributes are recorded in the first set of 
brackets < >. 

FIG. 8 illustrates a segment-element which is one 
type of graphics-element. Segment-elements are used by 
the graphics recognizer to note line segments it recog- 
nizes from the bitmap image. The id-attribute enables 
higher elements to reference the segment-element The 
coordinates of the ending points (xl, yl and x2, y2), 
relative to the top left comer of the page, defme the 
segment mathematically. The uncertainty about the 
exact ending point coordinates is recorded in the dsl, 
dyl and dxi, dy2-attributes. Thus, dxl, dyl, dx2 and 
dy2 record possible offeets of the parameters (xl, yl, x2, 
y2) used to describe the line segment graphics structure. 
The segment thickness and its uncertainty are noted by 
the thick and dThick-attributes. An example of data 
recorded in a segment-element is provided below: 

<segmcnt id = 14x1 =210CWr 1 = 5jl « 1440x2 «2I00 
. d!)c2=5>^2=2l60thick=n></> 

As with the fbntDef-element, the attributes are pro- 
vided within the first set of brackets. Since the segment- 
element does not contain any character strings Qts con- 
tent is EMPTY), the first set of brackets is followed by 
an element-end marker </> or, since element-end 
mariceis are not required by a new element 

FIO. 9 illustrates an arc-element, which is another 
type of graphics-element Arc-dements arc used to note 
circles, circular arcs, ellipses and elliptical arcs recog- 
nized from the bitmap image by the graphics recog- 
nizer. The id-attributes enable higher level elements to 
reference the arc. The other attributes are: 

X, y, dx, dy: coordinates and uncertainty at the 
center of the circle, elliptical arc, measured from 
the top left comer of the page; 
r, dr: length and uncertainty of the radius of arc of 

a circle, or long axis of the arc of an elUpse; 
rShort, drshort: length and its imprecision of the 

short radius of the arc of an ellipse 
theta 1, dThetal: angle between the vertical axis 
and the line passing through the center and one 
of the end points of the ara This attribute is 
present for arcs only. The angle can be measured 
in mHiiradians; 
thetaT, dThetal: same as thetal, dThetal for the 

other endpoint; 
thetaO, dThetaCh angle between the vertical axis 
and the long axis of an ellipse. This attribute is 
present for ellipses and elliptical axes only; 
thick, dThick: thickness and uncertainty of the arc, 
circle or ellipse. 
An example of data recorded in an arc-element is as 
follows: 

>arc id=:5462x=2300/ir=8jJ=144O 
flrp«8r«210Orfr-. !5> </> 

FIG. 10 illustrates an image-element which is a third 
type of graphics-element The image element is used to 
denote a rectangular area of the page that has not been 
resolved as text or structured graphics, and is therefore 
left in bitmap form in a separate fHe. The image element 
contains the name of the file. 

The image element attributes encode the position and 
uncertainty relative to the top left comer of the page (x, 
dX} dy) dimensions (w, dw, h, dh) of the 

image. The resol-attribute is expressed in bits per unit of 



measurement (the units of measurement is supplied by 
the dsStream element, defined later). 

At the onset of the document recognition operation, 
the DRstream usually contains only image elements, 
5 one per digitized page of the paper document Gradu- 
ally, as character strings, line segments and arcs are 
extracted (using conventional techniques), the bitmaps 
are replaced by smaller and perhaps more numerous 
ones. At the completion of the operation, the only bit- 
10 maps left are the genuine half tone images and the por- 
tions of the document that the character recognizer and 
graphics recognizer could not decipher. 

A bitmap image stored in a file named "Squiggle" 
would be recorded as follows: 

15 

<iinage id=567x=t 1840 1680 w=260 
At«480> Squiggle</> 

FIG. 11 illustrates a spot-element, which is a fourth 
20 type of graphics-element Spot-elements contain small 
images and denote a very small rectangular area left in 
bitmap format: unrecognized small smudges, dingbats, 
xmknown symbols, eta The bitmap is small enough that 
its bitmap can be encoded conveniently in hexadecimal 
25 form as the contents of the spot-clement, rather than 
carried in a separate file. 

The X, dx, y and dy-attributes supply the position of 
the spot with respect to the top-left comer of the page. 
The bx-attribute gives the number of bits in the horizon- 
30 tal direction, it is constrained to be a multiple of eight 
The by-attribute gives the number of 1-bit high rows. 
When a spot element needs to be imaged, the hexadeci- 
mal value is consumed 8*bx bits (2*bx hexadecimal 
characters) at a time for each row. The hexadecimal 
35 value contains trailing 0 bits where appropriate. 

Suppose that the bullet • in the FIG. 1 sample page 
has not been recognized. It would be noted as a small 
image as follows; 

40 spot Id = 11 x=:390 7= 850 fa«25 

fy=25>O3FFBO0O . . . </> 

FIG. 12 illustrates references to other elements. The 
text, segment, arc, image and spot-elements may be 
45 grouped together by higher-level elements (text blocks, 
frames and pages, discussed below), via a reference to 
their identifier. A reference to a single element is made 
by an item-element, the single attribute of which has the 
value of the identifier of the referenced element. 
50 A reference to a consecutive succession of elements is 
made by a range element 'Trom** and **to** attributes 
refer to the identifiers of the first and last referenced 
elements. "First" and "last" are relative to the chrono- 
logical order in which the elements are found in the 
55 DRstream. A range-element is a short-hand notation for 
an unbroken succession of item-elements. 

Ambiguities about groupmg are denoted by altem- 
elements. Alternative groupings are used by processes 
to encode a number of reasonable element groupings. 
60 For instance, a page of text has been recognized as 
having four text blocks, two on the left side and two on 
the right; the logical structure processor (or logical 
reconstructer), unable to determine if the text reads as 
two columns or as two rows, groups them in the order 
65 top left, bottom left, top right, bottom right; or the 
order top left, top right, bottom left, bottom right; a 
downstream process like a syntactic analyzer might be 
able to resolve the ambiguity. 
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FIG. 13 illustrates a tBlock demeat tBlock-elements elements, such as the graphics-elements described 

encode rectangular areas forming an invisible boundary above, using a document description language. That is, 

around a text line or a set of equally spaced text lines. unrecognized bitmap images are placed in unresolved- 

The location of a tBlock, relative to the top left comer graphics-image-type elements (S170): image-elements if 

of the page, and their imprecision are given by the x, y, 5 they are large, or spot-elements and represented as hex- 

dx and dy-attributes. The dimensions and uncertainties adedmal values if they are small. If the graphics image 

are recorded by w,h,dw, and dh. The internal-attribute subimages are recognized and transformed into 

measures the interval between the equally-spaced lines coded graphics data, they are placed in one or more 

within the block; its value is zero when the text block segment-elements and/or arc-elements (S180). Addi- 

contains one line only. The xl, dxl, yl, dyl-attributes lo ^^^^ information regarding uncertainty (for example, 

give the location of the first character in the text block, -^^^ ^^^^^ parameters (coded data) describ- 

™® ^ ^ f!™' ? ^ w • , , ing the graphics structures optionally can be recorded 

FIG. 14 mustrates a frame^ement A frame-dem^t i^^t^ese clients (S185). 

encodes a rectanguUr area, smalkr than or equal to the transfonnmg graphics bitmap images 

page area. It is used to aggr^te tot b^l^ m^gfis^ 15 i^^o editable coded data usmg the present invention, the 

spots,arcsandse^ent.aswellasotherframes.Frames ^^^^ re^gni^ 200 acts as a first 

"^nS^'Is m^tes^^Jage-elemeat A page-element transformation means fonperformmg a first transforma- 

aggregates all the pieced of information about a digi- operauon on the bitmap graphics nnage to trans- 

tizid page ofa document Ifthere is no page-element in form the graphics bitmap miage into one or more gra^^ 

a DRkh^ it is assumed that aU the DRstream daU ^^s elements contaimng coded data definmg graphics 

belongs to a smgle page. structure^ and as a first identification means usmg the 

FIG. 16 illustrates a group-element A group element document description language for identifying tiie one 

enables a collection of tiie elements across page bound- or more graphics elements transformed by the first 

aries. It may be used by the logical structure recon- 25 transformation means, each graphics-element including 

structer and the semantic analyzer to indicate the flow an element type identifier indicating a type of coded 

of text across pages. ^ta regarding the recognized bitmap image contained 

FIG. 17 illustrates a drStream element At the top of in that element When the first transformation means 

the document type definition, is the drStream element determmes that the coded data contained in the gr^h- 

Its unit-attribute gives the name of the measurement ^ ics-element has not been transformed with a predeter- 

unit used throughout the drStream. The fraction-attrib- mined level of confidence, the identification means also 

ute indicates what firaction of the measurement unit the includes in the graphics-element uncertainty informa- 

coordinates, dimensions, and their imprecisioiis actually tion (offsets) determined by the first transformation 

represent For example, if the measures are in microns, means regarding the coded data contained in each 

the drStream element attributes are: graphics element 

The character recognizer 300 transforms the bitmap 

drStream uiut«nicter fi:action=ioooooo> textual image (or subimages) into coded character data 

^ , , ^ . (S120) which is then stored in the appropriate element 

FIGS. 18A-C ilhistrates all of the elements used m S140 or S150 (character-string or questionable- 

the disclosed page descnption language. character) as described above. In order to determine 

FIG. 19 illustrates a document re^jgmUon sy^ ^ ^^^^^ character data in a character- 

useable with die present mventoon. FIGS. 20 and 21 ^ string-element or a questionable-charactcr-dement, a 

flowcharts illustratmg procedures for operating the jetenmnation is made in S130 as to whether a recog- 

FIG. 19 system accordmg to the present mv^on. In ^ ^j^^^ recognized with at least a predeter- 

ordertomput abitmapmiage (SIOOX a paper document j^^j of confidence. Although the insertion of a 

IS scamied usmg an character mto a questionable^haracter-element serves 

bitmap docmnent "^^S^^^J^J^^^^ to convey micertlty information about that character, 

scamimg proc^ can be performed essendaUy at the information such as alternate possible uncer- 

same tnne that the recogmtion processes are performed, uxikjxuuu^vu o^h^ r • * «vl 

or the bitmap documenVhnage 110 can be s^lied on ^ ^^^^^ ^"^"^ ° n^ftlT t nL^^^ 
son^ty^d^tromosiot^f^mcdimnsnohZ^h^d 50 characters can f«l^^^^^f/^^^ 
or floppV disk. The bitmap docum«it image 10 is sup- character-element (S155). Thus, the character recog- 
pHedto a conventional Jegmenter 150 (SlO) whiS mzer 300 will produce a sueam of characto-st^^ 
iegmentsthebitmapimageintosmallersubmiages,such ments and qu^onable-c^racter-elemenU, which can 
as, for example textual subimages containing only text, tha. be supphed to a word recogn^r 400, 
and graphics subimages containing only graphics. The 55 The word recognizer 4O0 mclud^ a dictionary or 
segmenter 150 can iterativdy segment the bitmap image of words tiierem. The word recognizer 400 
into smaller subimages until each subimage is recog- operating according to the present mven^. would 
nizcd as containing only text or only graphics. The then perform the procedure illustrated m FIG. 20 for 
graphics subimages are then supplied to a structure each questionable-character-element Fu^t. m S200. a 
image recognizer (or graphics recognizer) 200, while 60 pluraHty of characters are sequentially substituted for 
the textual subimages are supplied to a character recog- the questionable-character-element in the word contain- 
nizer 300. Of course, if it is known in advance that the ing the questionable-character-element. In S210, a de- 
bitmap document image contains only text or graphics, termination is made as to whether any of the words 
it can be supplied directiy to the structure image recog- formed by the substituting step (S200) were found m the 
nizer 200 or character recognizer 300. 65 lexicon of the word recognizer 400. Such words are 

The structure image recognizer 200 then transforms referred to as "verified words'*. If no verified words are 

the bitmap graphics image (or subimages) into coded found, the questionable-character-element is returned in 

graphics data (S160) which can be recorded in graphics- S240, and optionally, in S245, the uncertainty infonna- 
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tion contained in the questionable-charactcr-element is ous changes may be made without departing from the 

updated based upon any determinations made by the spirit and scope of the invention as defined in the fol- 

word recognizer 400. If the determination in S210 is lowing claims, 

positive, each verified word is placed in a verified- What is claimed is: 

word-clement (S220). Next, in S230, if more than one 5 1. A method of transforming a document represented 

verified-word-dements are produced from a single as a bitmap image into an editable coded data stream 

questionable-character-elementy the multiple verified- defined using a machine readable document description 

word-elements are placed in an altemate-word-element. language that records coded information resulting from 

Each altemate-word-element can be transformed into the document transformation process and information 

a character-string-element by a semantics analyzer 500 10 regarding uncertainties in the document transformation 

which attempts to determine which of the verified process, comprising: 

words in an altemate-word-element is correct based performing a first transformation operation on at least 

upon surrounding words. If the semantics analyzer can- a text portion of said bitmap image using a charac- 

not deterarine which of the verified words in an alter- ter recognition apparatus, to transform at least said 

nate-word-dement are correct, it returns the alternate- 15 text portion of said bitmap image into coded infor- 

word-element, and optionally provides uncertainty in- mation recognized with a level of confidence; 

formation for each of the verified words in each vcri- outputting and recording said coded information into 

fied-word-dcment therein. one or more elements that are defined using said 

Thus, when transforming textual bitmap images into document description language, each element hav- 
editable coded data using the present invention, the 20 ing a machine readable element type identifier that 
character recognizer 300 acts as a first transformation indicates the type of said coded information re- 
means for performing a first transformation <^ration garding the recognized bitmap image recorded in 
on the textual bitmap image to transform the textual said element so that the type of coded information 
bitmap image into one or more elements containing contained in each element can be known without 
coded character data; and as a first identification means 25 examining the coded information contained in each 
using the document description language for identifying element, the element type identifier for each ele- 
the one or more elements transformed by the first trans- ment having been defined based on the type of 
formation means, each element incltiding an element coded information recorded in the element and the 
type identifier indicating a type of coded character data level of confidence with which said bitmap image 
regarding the recognized bitmap textual image con- 30 represented by said coded information was recog- 
tained in the element Elements containing characters nized so that each of said elements is selectively 
not recognized with a predetermined level of confi- identified, each element having coded information 
dence are recorded in elements identified by the first of a single type recorded therein; and 
identification means as questionable-character-ele- when said character recognition apparatus deter- 
ments, while certain characters are recorded in ele- 35 mines that the recognized bitmap image contained 
ments identified as character-string-elements. in an element has not been recognized with at least 

The word recognizer 400 acts as a second transforma- a predetermined level of confidence, recording m 
tion means for transforming each questionable-charac- said element uncertainty information determined 
tcr-clcment and adjacent confidently recognized char- by said first recognition apparatus regarding said 
acteis in a same word as the questionable-character-de- 40 recognized bitmap image contained in said ele- 
ment into one or more verified words by substituting meni; 

alternate characters for the questionable-character-ele- wherdn said element type identifier is a character- 

ment and verifying that a word resulting firom the sub- string-element or a questionablc-character-ele- 

stitution exists in a lexicon; and as a second identifica- ment, each character-string-element containing a 

tion means using the document description language for 45 string of consecutive characters recognized by said 

placing each verified word in a vcrified-word-dement. character recognition ^paratus with at least said 

When more than one verified-word-elements are ere- predetermined level of confidence, each questiona- 

ated for a questionable-character-element, the second ble-character-element containing said uncertainty 

identification means also places the more than one veri- information determined by said character recogni- 

fied-word-elements in an altemate-word-element. The 50 tion apparatus for a character which was not rec- 

second identification means Tngm tg^nja the questionable- ognized with at least said predetermined level of 

character-element when no verified words arc deter- confidence by said character recognition appara- 

minffri tO Cxist. tUS. 

The altematc-word-demcnt can then be supplied to 2. The method of claim 1, wherein said uncertainty 

semantics analyzer 500 which acts as a means for deter- 55 information includes a degree of uncertainty with 

mining which verified word within an alternate-word- which said character recognition apparatus transformed 

element is a correct verified word based on words sur- said bitmap image, 

rounding the ahemate-word-clement; and as a third 3. The method of claim 1, further comprising: 
identificatioQ means for identifying the correct verified for each questionable-character-element, using a 
word, and for replacing the altemate-word-dcment 60 word recognizer to transform said questionable- 
with a character-string-clement containing the correct character-element and adjacent confidently recog- 
verified word nized characters in a same word as said questiona- 
While this invention has been described in conjunc- ble-character-element into one or more elements 
tion with specific embodiments thereof, it is evident that having an element type identifier defined as a veri- 
many alternatives, modifications and variations will be 65 fied-word-element, by substituting alternate char- 
apparent to those skilled in the art Axx^rdingiy, the acters for said questionable-character-element 
preferred embodiments of the invention as set forth when one or more words created by said substitut- 
herein are intended to be illustrative, not limiting. Vari- ing are recognized by said word recognition appa- 
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ratus; when more than one verified-word-element 
is tramfonned for each questionable-character-ele- 
mentt said more than one verified-word-elements 
being placed in an element having an element type 
identifier defined as an altemate-word-elemen^ 
said question-able-character-element remaining 
when no verified words are recognized by s^d 
word recognition apparatus. 

4. The method of claim 3, further comprising: 
for each ahemate-word-element, using a semantics 

analyzer to transform verified words of the veri- 
fied-word-elements contained in each alternate 
word element into an element identified as a cba- 
racter-string-clement corresponding to one of the 
verified words contained in said altemate-word- 
element when said semantics analyzer determines 
that said one of said verified words is a correct 
word, said altemate-word-element remaining when 
none of said verified words is determined to be a 
correct word by said semantics analyzer. 

5. The method of claim 4 wherein for each verified- 
word-element contained in each alternate word ele- 
ment, said semantics analyzer detennines and identifies 
a confidence level of each verified word contained in 
each verified-word-element, and indicates the confi- 
dence level of each verified-word-element v^^en none 
of the verified words in an altemate-word-element are 
determined to be a correct word by said semantics ana- 
lyzer. 

6. The method of claim 1, wherein for each questiona- 
ble-character-element, said uncertainty information 
pertaining to a character not recognized with at least 
said predetermined level of confidence includes a most 
likely uncertain character identified by said character 
recognition apparatus. 

7. The method of claim 6, wherein for each questiona- 
ble-character-element, said uncertainty information 
pertaining to a character not recognized with at least 
said predetermined levd of confidence also includes a 
degree of confidence determined by said character rec- 40 
ognition apparatus for said most Hkely uncertain char- 
acter. 

8. The method of claim 1, wherein for each questiona- 
ble-character-element, said uncertainty information 
pertaining to a character not recognized with at least 45 
said predetermined level of confidence includes alter- 
nate possible uncertain characters identified by said 
character recognition apparatus. 

9. The method of claim 8, wherein for each questiona- 
ble-character-eiement, said uncertainty information SO 
pertaining to a character not recognized with at least 
said predetermined level of confidence also includes a 
degree of confidence determined by said character rec- 
ognition apparatus for each said alternate possible un- 
certain character. 35 

10. A method of transforming a document repre- 
sented as a bitmap image into an editable coded data 
stream defined using a machine readable document 
description language that records coded information 
resulting from the document transformation irocess 60 
and information regarding uncertainties in the docu- 
m«Jt transformation process, comprising the steps of: 

a) analyzing text portions of said bitmap image using 
a character recognizer to transform said text por- 
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document description language, each element hav- 
ing a machine readable element type identifier that 
indicates the type of said coded information re- 
garding the recognized bitmap image recorded in 
said element so that the type of coded information 
contained in each element can be known without 
examining the coded information contained in each 
element; 

c) the element type identifier for each element having 
been defmed using said document description lan- 
guage based on the type of coded information re- 
corded in the elements by defining the type identi- 
fier of each of said elements as character-string-ele- 
ments or questionable-character-elements, each 
element defined as a character-string-element con- 
taining a string of consecutive characters recog- 
nized by said character recognizer with at least a 
predetermined level of confidence, each element 
defined as a questionable-character-element con- 
taining information pertaining to a character not 
recognized with at least said predetermined level of 
confidence, said character-string-elements and said 
questionable-character-elements being arranged as 
a continuous stream of elements based on an order 
of the characters in said Intmap hnage, each ele- 
ment having coded information of a single type 
recorded therein; 

d) for each dement defined as a questionable-charac- 
ter-element, analyzmg the coded information per- 
taining to the character contained therein and adja- 
cent confidently recognized characters contained 
in a same word as said questionable-character-ele- 
ment using a word recognizer, to transform and 
record said coded information and adjacent confi- 
dentiy recognized characters into one or more 
elements having type identifiers defmed as verified- 
word-elements by substituting alternate characters 
for said questionable-character-element and deter- 
mining whether one or more verified words cre- 
ated by said substituting are recognized by said 
word recognizer^ and 

e) when more than one verified-word-elements are 
transformed for each questionableK:haracter-ele- 
ment, placing said more than one elements defined 
as verified-word-elements in an element having an 
element type identifier defined as an altemate- 
word-element, said questionable-character-element 
remaining when no verified words are recognized 
by said word recognizer- 

11. The method of daim 10, further comprising: 

f) for each altemate-word-element, analyzing the 
verified words of the verified-word-elements con- 
tained therein and surrounding words using a se- 
mantics analyzer, to transform said altemate-word- 
element into an element identified as a character- 
string-element corresponding to one of verified 
words contained in said altemate-word-element 
when said semantics analyzer determines that one 
of said verified words is a correct word, said alter- 
nate-word-element remaining when none of said 
verified words is determined to be a correct word 
by said semantics analyzer. 

IZ The method of claim 11, wherein for each veri- 
fied-word-element contained in each altemate-word- 



tions of said bitmap image into said coded informa- 65 element, said semantics analyzer determines and identi- 
tion recognized with a level of confidence; fies a confidence level of each verified word contained 

b) outputting and recording said coded information in each verified-word-element, and indicates the confi- 
into a series erf" elements that are defined using said dence level for each verified-word-element when none 
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of the verified words in an altemate-word-clement are 
determined to be a correct word by said semantics ana- 
lyzer. 

13. The method of claim 10, wherein for each ques- 
tionable character element, said information pertaining 5 
to a character not recognized with at least said predeter- 
mined level of confidence includes a most likely uncer- 
tain character identified by said character recognizer. 

14. The method of claim 13, wherein for cadi ques- 
tionable-character-element, said information pertaining 10 
to a character not recognized with at least said predeter- 
mined level of confidence also includes a degree of 
confidence determined by said character recognizer for 
said most likely uncertain character. 

15. The meUiod of claim 10, wherein for each ques- 15 
tionable-character-element, said information pertaining 

to a character not recognized with at least said predeter- 
mined level of confidence includes alternate possible 
uncertain characters identified by said character recog- 
nizer. 20 

16. The method of claim 15, wherein for each ques- 
tionable-character-element, said information pertaixiing 
to a character not recognized with at least said predeter- 
mined level of confidence also includes a degree of 
confidence determined by said character recognizer for 25 
each said alternate possible uncertain character. 

17. An automatic document recognition apparatus for 
transforming text documents represented as bitmap 
image data into an editable coded data stream defined 
using a machine readable document description Ian- 30 
guage that records coded data resulting from the docu- 
ment transformation process and information regarding 
uncertainties in the document transformation process, 
said apparatus comprising: 

a character recognizer having: 35 

a) first transformation means for performing a char- 
acter recognition operation on said bitmap image 
representation of said document to transform 
said document into coded character data recog- 
nized with a level of confidence; and 40 

b) first identification means for outputting and re- 
cording said coded character data into one or 
more elanents that are defined using said docu- 
ment description language, said first identifica- 
tion means further defining a machine readable 45 
element type identifier for each element, said 
element type identifier indicating the type of said 
coded character data regarding the recognized 
bitmap image recorded in said element so that 
the type of coded character data contained in 50 
each element can be known without examining 
the coded character data contained in each ele- 
ment, said identification means selectively identi- 
fying said one or more elements by defining the 
element type identifier for each element based on 55 
the type of coded data recorded in the element 
and the level of confidence with which said bit- 
map image was recognized, each element having 



coded character data of a single type recorded 
therein, and, when said first transformation 
means determines that the coded character data 
contained in the element has not been trans- 
formed with a predetermined level of confi- 
dence, said identification means also recording in 
said element uncertainty information determined 
by said first transformation means regarding said 
coded character data contained in said element; 
wherein said element type identifier is a character- 
string-element or a questionable-character-element, 
each character-string-element containing a string of 
consecutive characters recognized by said character 
recognition apparatus with at least said predetermined 
level of confidence, each questionable-character-ele- 
ment containing said uncertainty information deter- 
mined by said character recognition apparatus for a 
character which was not recognized with at least said 
predetermined level of confidence by..said character 
recognition apparatus. 

18. The apparatus of claim 17, wherein said uncer- 
tainty information includes a confidence level with 
whidi said first transformation means determined said 
coded data. 

19. The apparatus of claim 17, further comprising: 
a word recognizer having: 

i) second transformation means for transforming 
each questionable-character-element and adja- 
cent confidently recognized characters in a same 
word as the question-able-character-element into 
one or more verified words by substituting alter- 
nate characters for the questionable-character- 
element and verifying that a word resulting from 
said substituting exists in a lexicon; and 
ii) second identification means using said document 
description language for placing each verified 
word in an element having an element type identi- 
fier defined as a verified-word-element, and when 
more than one verified-word-elements are created 
for a questionable-character-element, placing said 
more than one verified-word-elements in an ele- 
ment having an element type identifier defined as 
an altemate-word-element, said questionable- 
character-eiement remaining when no verified 
words are determined to exist 

20. The apparatus of claim 19, further comprising: 
a semantics analyzer including: 

1) means for determining which verified word 
within an altemate-word-element is a correct 
verified word based on words surroimding the 
altemate-word-element; and 

2) third identification means for identifying said 
correct verified word and for replacing said 
altemate-word-element with an element identi- 
fied as a character-string-element containing said 
correct verified word. 

« 4t * * * 



65 



10/31/2003, EAST 



Version: 1.4.1 



6,li 

9 

14. The media editing system of claim 13 wherein the 
alternate media object display means are operative to display 
the identifiers to the user in a display area that is adjacent to 
a timeline that displays information about the composition. 

15. The media editing system of claim 13 wherein the 
alternate media object display means are responsive to user 
actuation of a portion of a timeline that displays information 
about the composition on a display that is responsive to the 
media editing system, and wherein the portion of the time- 
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line is at a position in the timeline that corresponds to the one 
of the positions in the composition that the one of the media 
objects in the machine-readable composition occupies. 
16. The media editing system of claim 15 wherein the 
5 alternate media object display means arc operative to display 
the identifiers to the user in a display area that is adj acent to 
the portion of the timeline. 

♦ * * « * 
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